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Abstract 

Importance weighting is a convenient general way to adjust for draws from the 
wrong distribution, but the resulting ratio estimate can be noisy when the importance 
weights have a heavy right tail, as routinely occurs when there are aspects of the target 
distribution not well captured by the approximating distribution. More stable estimates 
can be obtained by truncating the importance ratios. Here we present a new method 
for stabilizing importance weights using a generalized Pareto distribution fit to the 
upper tail of the distribution of the simulated importance ratios. 
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1. Introduction 

Importance sampling is a simple correction that is used when we can more easily obtain 
samples from some approximating distribution than directly from the target distribution. 
Expectations with respect to the target distribution can be estimated by weighting the 
samples by the ratio of the densities. But when the approximating distribntion is narrower 
than the target distribution—or, more generally, when the approximating distribution is a 
poor fit—the distribution of importance ratios can have a heavy right tail, which can lead to 
unstable importance weighted estimates, sometimes with inhnite variance. 

lonides (2008) introduced a truncation scheme for importance ratios in which the trun¬ 
cation point depends on the number of simulation draws so that the resulting importance- 
weighted estimates have hnite variance and are simulation consistent. In this paper we 
take the trnncation scheme of lonides and add to it the idea of htting a generalized Pareto 
distribution to the right tail of the distribution of importance ratios. Our method, which we 
call Pareto smoothed importance sampling (PSIS), not only reduces bias but also provides a 
natural diagnostic for ganging the reliability of the estimate. 

After presenting some background material in Section 2, we present our proposed method 
in Section 3, toy examples in Section 4, and pratical examples in Section 5, concluding in 
Section 6 with a brief discussion. Some of the content in Sections 2 and 3 is covered in 
Vehtari et al. (2016b) in relation to approximate leave-one-out cross-validation. Here we 
present it again in more general form, supplemented with additional details and more general 
commentary not tied to any specihc application of the algorithm. 
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Alfred P. Sloan Foundation, U.S. National Science Foundation, Institute for Education Sciences, and Ofhce of 
Naval Research for partial support of this research. 
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2. Importance sampling 

Suppose we want to estimate an integral, 


J h{e)p{9)d9, 


( 1 ) 


where h{9) is a function and p{9) is a probability density which we cannot easily or cheaply 
draw from directly. Instead there is an approximating density g{9) from which we can easily 
generate random draws. The integral (1) can be rewritten, 

( h(p\r.(p\ - 9{0)d9 

J ^ fm/msmo • 

and it can then be estimated using S draws 9^,... ,9^ from g{9) by computing 


where the factors 
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are called importance ratios. 

If p is a normalized probability density, the denominator of (2) is 1. However, in general p 
might only be known up to a normalizing constant, as is common in Bayesian inference where 
p might represent the posterior density of interest (with the dependence on data suppressed 
in our notation). It is therefore standard to use the ratio estimate (3), for which only the 
relative values of the importance ratios are needed. 

In Bayesian analysis, proposal distributions g are often recommended based on simple 
approximations, for example, normal, split normal, or split-t fit at the mode (Geweke, 1989), 
or mixtures of multivariate normals, or approximate distributions obtained by variational 
inference or expectation propagation. Another application of importance sampling is for 
leave-one-out cross-validation, in which case the approximating distribution g is the full 
posterior and the target distribution p is the cross-validated posterior, excluding the likelihood 
for one observation (Gelfand et ah, 1992; Gelfand, 1996), with the entire computation repeated 
for each data point, hence the need for quick computations. 

2.1. From importance ratios to importance weights 

Geweke (1989) shows that if the variance of the distribution of importance ratios is hnite, 
then the central limit theorem holds for the convergence of the estimate in (3). Chen and 
Shao (2004) show further that the rate of convergence to normality is faster when higher 
moments exist. In simple cases, the existence of the variance and higher moments can be 
checked analytically (Peruggia, 1997; Epifani et ah, 2008; Robert and Casella, 2004; Pitt et ah. 
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2013). In general, however, we wonld like a procednre that works based on the importance 
ratios alone, withont reqniring additional analysis of the distribntions. 

If the variance of the distribution of importance ratios does not exist, then in general 
the importance weighted estimate cannot be trusted, and it makes sense to replace the 
importance ratios by more stable weights. Thus, (3) is replaced by 

I Ell h{e^)w{d^) 

1 ^ 5 —( 5 ) 

where the importance weights w are some function of the importance ratios r from (4). Two 
extremes are w oc r (raw importance weighting) and tc oc 1 (identity weights, equivalent to 
just using the approximating distribution g). 

Approaches that stabilize the weights can be interpreted as compromising between an 
unstable estimate of p and a stable computation using g. It would be tempting to characterize 
this as a bias-variance tradeoff, but that would not be quite correct because raw importance 
weighting (3) is a ratio estimate and thus is itself biased, and this bias can be considerable if 
the distribution of the importance ratios is long-tailed. 

2.2. Truncated importance sampling 

Truncatated importance sampling is the same as standard importance sampling but using 
weights obtained by truncating the raw ratios. lonides (2008) proposes a scheme in which the 
truncation point depends on the sample size S, and each individual weight Wg is obtained 
from the corresponding ratio by taking 

Wg = min (^rg,'/Sr^, (6) 

where f is the average of the original S importance ratios. lonides (2008) not only proves 
that the distribution of these weights is guaranteed to have hnite variance, but also shows 
that the resulting importance sampling estimate of interest has a mean square error close to 
that of an estimate obtained using a case-specihc optimal truncation point. Unfortunately, 
while this truncation method can greatly improve stability, it comes at the expense of bias. 
And, as our examples will demonstrate, this bias can be large. 


2.3. Sample based estimate using the generalized Pareto distribution 


To make a sample based estimate of the existing moments, Koopman et ah (2009) suggest 
analyzing the properties of a generalized Pareto distribution ht to the upper tail of the 
distribution of the importance ratios r{6^). Pickands (1975) proves that, if the unknown 
distribution function lies in the domain of attraction of some extremal distribution function, 
then, as the sample size increases and a threshold for the tail is allowed to increase, the upper 
tail of an unknown distribution is well approximated by the three-parameter generalized 
Pareto distribution. 


p{y\u,a, k) 




( 7 ) 
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where m is a lower bound parameter, y is restricted to the range (m, oo), a is a scale parameter, 
and /c is a shape parameter. 

Koopman et al. (2009) set the lower bound u to the chosen threshold and use a maximum 
likelihood estimate for [a, k). Because the generalized Pareto distribution has the property 
that when k > 0 the number of existing moments is less than they then form a 

statistical test of the hypothesis k < 1/2, from which an inference can be drawn about 
whether the underlying distribution has a finite variance. 

Pickands (1975) notes that to obtain asymptotic consistency, the threshold u should be 
chosen so that the sample size M in the tail should increase to inhnity while the proportion 
of simulation draws in the tail, M/S, goes to zero. 

Zhang and Stephens (2009) propose a quick empirical Bayes flavored estimation method 
for the parameters of the generalized Pareto distribution, which has lower bias and higher 
efficiency than the maximum likelihood estimate. For this method, the distribution is 
reparameterized as {b, k), where b = k/a. The parameters b and k are highly correlated and 
the likelihood is replaced by a prohle likelihood where k is chosen to maximize the likelihood 
given b. The prohle likelihood is combined with a weakly informative prior for b and the 
posterior mean b is computed numerically. Finally k is obtained by maximizing the likelihood 
given b, and a is set to k/b. 

Zhang and Stephens (2009) show that this estimate has a small bias, is highly efficient, 
and is both simple and fast to compute. It is likely that an even better estimate could be 
obtained from a fully Bayesian approach, but since speed is important in many applications 
we use their fast estimate for now. 

3. Pareto smoothed importance sampling 

We propose a novel importance sampling estimate that has the hnite-variance property of 
truncated importance sampling while also reducing bias by htting the generalized Pareto 
distribution to the upper tail of the weight distribution. 

Generalized Pareto distribution fit to the tail. We start by htting a generalized 
Pareto distribution (7) to the importance weight values above a threshold u. As mentioned 
in Section 2.3, the shape parameter k of the generalized Pareto distribution can be used 
to characterize the thickness of the of the tail of the importance weight distribution and 
determine the existence of moments. 

• \i k < \ then the distribution of importance ratios has hnite variance. In this case the 
central limit theorem holds and we can observe fast convergence of the estimate. Given 
reasonably large S, the estimate is not sensitive to the largest weight values. 

• If ^ < /c < 1 then the variance is inhnite, but the mean exists. As the generalized 
central limit theorem for stable distributions holds, the distribution of the estimate 
converges to a stable distribution, although the convergence of the raw importance 
sampling estimate is slower and the estimate is sensitive to rare large weights in the 
long tail. 
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• \i k >1 then neither the variance nor the mean exists. 

In our experiments we choose the threshold u so that proportion of the samples in the tail, 
M/S, is between 10% and 20%, depending on how large S is. As the value of M increases, 
the uncertainty in the estimate shrinks, but the bias is reduced when the ratio M/S is smaller. 
The experimental results were not sensitive to the exact value of u. 


Smoothing of the weights using the generalized Pareto distribution. We stabilize 
the importance weights by replacing the M largest weights above the threshold u by the 
expected values of the order statistics of the htted generalized Pareto distribution 


1-1 


z-1/2 

M 


z = 1, 




where F ^ is the inverse-CDF of the generalized Pareto distribution. This reduces the 
variation in the largest weights, and thus typically reduces the variance of the importance 
sampling estimate. As the largest weight is F~^ weights are truncated and the 

variance of the estimate is hnite. Compared to the simple truncation by lonides (2008), the 
bias is reduced as the largest weights are not all truncated to the same value, but rather 
spread according to the estimated tail shape. 


Truncation of very large weights. When the shape parameter k is close to or larger 
than 1, small changes in the estimate of k have a big effect on the largest values of the 
inverse-CDF. To reduce this variation, we use an additional truncation of the largest weights 
despite the potential for increased the bias. Since the large weights have been smoothed, we 
need less truncation than in the truncated importance sampling by lonides (2008). Based on 
simulations we recommend truncation with S^^^w, where w is the average of the smoothed 
weights. 


Diagnostics. Previous research has focused on identifying whether the variance is hnite or 
inhnite (Peruggia, 1997; Epifani et ah, 2008; Koopman et al., 2009), but we demonstrate in 
Section 4that it can often be more useful to look at the continuous k values than the discrete 
number of moments. Based on our experiments, Pareto smoothed importance sampling 
is able to provide estimates with relatively small variance and bias when k < 0.7. With 
/c > 0.7 the estimator can have high variance and high bias, and if the method is implemented 
in a software package we recommend reporting to the user if /c > 0.7. Depending on the 
application a lower threshold could also be used. 


Summary of method. Given importance ratios Vg, s = 1,..., S, our method proceeds as 
follows. 

1. Set M = 0.25* and set u to the 80th percentile of the values of r^. 

2. Fit the generalized Pareto distribution (7) to the sample consisting of the M highest 
importance ratios, with the lower bound parameter u as just chosen, estimating the 
parameters k and a using the method from Zhang and Stephens (2009). 
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3. Replace the M highest importance ratios by the expected values of their order statistics 
of the generalized Pareto distribution given the estimated parameters from the previous 
step. The values below u are unchanged. We now refer to the S values as weights and 
label them Wg, s = 1,..., S. 

4. Truncate the weights at the value where w = ^ is the average weight 

value. 

5. If the estimated shape parameter k exceeds 0.5 (or 0.7, as suggested above), report a 
warning that the resulting importance sampling estimates might be unstable. 

This method has been implemented in an R function called psislw which is included in the 
loo R package (Vehtari et ah, 2016a). The package is available from CRAN and the source 
code can be found at https://github.com/stan-dev/loo. Python and Matlab/Octave 
implementations are available at https://github.com/avehtari/PSIS. 

4. Toy examples 

In the following toy examples we know the true target value and vary the proposal distribution. 
This allows us to study how k works as diagnostic and how bad the approximate distribution 
has to be before the importance sampling estimates break down. In each of the examples 
we simulate S = 16000 draws from the proposal distribution. We repeated the experiments 
with S = 4000 and S = 1000 and obtained similar resnlts, bnt the differences between the 
methods were clearer with larger S. 

4.1. Exponential, approximating distribution is too narrow 

In the first toy example, we demonstrate why it is usefnl to look at the continnons k valne, 
instead of discrete number of moments. The proposal distribntion is exponential with 
rate 1 and the target distribntion is exponential with varying valne of the rate parameter, 
6 = 1.3,1.5,1.9,2.0, 2.1, 3.0, or 10.0. The target fnnction is h = 6, that is, we are estimating 
the mean of the target distribution. In this simple case it is possible to compute analytically 
that the variance is finite only if 0 < 2 (Robert and Casella, 2004). 

Figure 1 shows the comparison of regular importance sampling (IS), truncated importance 
sampling (TIS), and Pareto smoothed importance sampling (PSIS). The vertical axis is the 
ratio of the estimated mean to the true mean (i.e., values close to 1 are good). For each case 
the mean of the estimated Pareto shape valnes k is shown along the horizontal axis. The 
figure confirms that when the variance is finite the errors are smaller, but the Pareto shape 
value k gives additional information abont the distribntion of errors in both the hnite and 
inhnite variance cases (since we are nsing a hnite nnmber of samples, the sample variance 
is hnite). Compared to the other methods, Pareto smoothing rednces the variance of the 
estimate when k < 1/2; PSIS has lower variance than both IS (6-80%) and TIS (6-40%). 
Using PSIS we can get low variance estimates also when k > 1/2 (and IS now has inhnite 
variance). Also notable is that the bias of PSIS is relatively small for k less than about 0.7, 
and the bias of TIS is larger than the bias of PSIS in all cases. The resnlts of the other 
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Figure 1; Toy example 1: Violin plot for regular importance sampling (blue), truncated 
importance sampling (red), and Pareto smoothed importance sampling (yellow) estimates 
ofE{9)/9o in 10000 repeated simulations with different values of 9 q, in which the proposal 
distribution is exponential with unit rate, and the target distribution is exponential with 
different choices of rate parameter. The true mean in this example is 1 in each case. The 
vertical axis has been truncated to 2.8, cutting off two simulations (out of 10000) where the 
estimate was up to 5 times larger than the true value. 


experiments reported below show similar behavior of the compared methods with respect to 
the Pareto shape value k. 

4.2. Univariate normal, approximating distribution is too narrow. 

Our next example is the same as the example used by lonides (2008) to demonstrate the 
performance of TIS when the proposal distribution is narrower than the target. The target 
distribution is p{9) = N{9 \ 0,1) and the proposal is g{9) = N{9 \ 0, cr^), with simulations 
performed for a = 0.1,0.2,..., 0.8. In this and all subsequent examples the target function 
is h = I; that is, we are implicitly estimating the ratio of the normalizing constants of the 
target and approximate distributions. 

We plot the estimate E{h) on the log scale as it improves the clarity of the hgures. The 
true value for the logarithm of the integral is 0. Figure 2 shows comparison of IS, TIS, and 
PSIS for the second toy example. TIS and PSIS have much smaller variance than IS, and 
PSIS has a slightly smaller bias than TIS. All methods fail if the proposal distribution is much 
narrower than the target distribution. Figure 3 shows the PSIS estimate of E{h) versus the 
estimated tail shape k over 1000 repeated simulations. Both the variance and bias increase 
considerably after k > 1/2. 

4.3. Univariate t, approximating distribution is shifted and slightly too narrow. 

In the previous example, the means of the target and the proposal distributions were the 
same. Next we demonstrate the performance of the various methods when the mean of the 
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Figure 2: Toy example 2: Violin plot for regular importanee sampling (blue), truncated impor¬ 
tance sampling (red), and Pareto smoothed importance sampling (yellow) estimates of logE{h) 
in 1000 repeated simulations with different values of a E (0.1, 0.2, 0.3,0.4, 0.5, 0.6, 0.7, 0.8). 
The true value in this example is 0 in each case. 



Figure 3: Toy example 2: Scatterplot for Pareto smoothed importance sampling estimates of 
E{h) versus the estimated tail shape k in 1000 repeated simulations. The different colors sep¬ 
arate the points by value of a E (0.1, 0.2, 0.3, 0.4,0.5, 0.6, 0.7, 0.8). Larger k values correspond 
to smaller values of a. 













4 



J__^^__L 

0 0.5 1 1.5 2 2.5 3 


Figure 4: Toy example 3: Violin plot for raw importance sampling (blue), truncated importance 
sampling (red), and Pareto smoothed importance sampling (yellow) estimates ofE{h) in 1000 
repeated simulations with the shift parameter /i set to the increasingly challenging values, 
0, 0.5,..., 2.5,3. The true value for log E{y) is 0 here, as in our other toy examples. 

proposal distribution is varied. The target is p{d) = t2i{0 \ 0,1) and the proposal distribution 
is g{6) = t2i{0 I /r, 1 — ^), with simulations performed for /r = 0, 0.5,..., 2.5, 3.0. We use 
the t distribution in this simulation to better match the distributions we might encounter 
in Bayesian inference, and the use of 1 — ^ represents the scenario from leave-one-out 
cross-validation in which the approximating distribution includes all the data and thus tends 
to have slightly lower variance than the target, which excludes one of the factors from the 
likelihood. 

Figure 4 shows the comparison of the methods for the third toy example. TIS and PSIS 
have much smaller variance than IS, and PSIS has much smaller bias than TIS. All methods 
eventually fail when the proposal distribution is far away from the essential mass of the target 
distribution. Figure 5 shows the variation in the estimated tail shape parameter k and the 
PSIS estimate. It can again be seen that the variance and bias increase after k > 1/2, but 
even if fc > 1 the bias from PSIS is still small in this example. 

4.4. Multivariate normal, approximating distribution is shifted and slightly too narrow. 

In our final toy example we compare the performance of the various importance sampling 
methods as we increase the number of dimensions of the vector 6. The target distribution is 
p{6) = N{6 \ 0,1) and the proposal is g{6) = t2i{0 \ 0.4 x 1,0.8/), and we examine performance 
in one dimension and then for dimensions p = 20,40,..., 100. As with our other examples 
we have purposely chosen an approximating distribution that is narrower than the target so 
as to make the problem more difficult. 

Figure 6 shows the sampling distributions of the estimated integral for the three methods, 
for each value of p. Again PSIS and TIS have much smaller variance than raw importance 
sampling, and PSIS has much smaller bias than TIS. All methods eventually fail as the 
number of dimensions increases and even a small difference in the distributions is amplified. 
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k 

Figure 5: Toy example 3: Scatterplot for Pareto smoothed importanee sampling estimates 
of E{h) versus estimated tail shape k in 1000 repeated simulations, with different colors 
corresponding to the shift parameter jx set to 0, 0.5,..., 2.5, 3. 



Number of dimensions p 


Figure 6: Toy example f: Violin plot for raw importance sampling (blue), truncated importance 
sampling (red), and Pareto smoothed importance sampling (yellow) estimates ofE{h) in 1000 
repeated simulations (each based on 16,000 draws from the approximate density). We make 
the conditions increasingly more challenging by setting the dimensionality parameter first to 
1, then to 20, fO, ..., 100. 
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Figure 7; Toy example 4- Scatterplot for Pareto smoothed importanee sampling estimates 
of E(/i) versus estimated tail shape k in 1000 repeated simulations with different colors 
corresponding to the numbers of dimensions p G (1, 20,40,60, 80,100). 


Figure 7 shows the variation in the estimated tail shape parameter k and in the PSIS estimate. 
The variance and bias increases after k > 1/2, but with PSIS it is possible to get only a small 
bias even when k is a bit larger than 1. 

5. Practical examples 

In this section we present two practical examples where the Pareto shape estimate /c is a 
useful diagnostic and Pareto smoothed importance sampling clearly improves the estimates. 
In the first example PSIS is used to improve distributional approximation (split-normal) of 
the posterior of logistic Gaussian process density estimation model. In the second example 
PSIS is used for fast and reliable approximate leave-one-out cross-validation. 

5.1. Improving distributional posterior approximation with importance sampling 

For computational efficiency, in Bayesian inference posterior distributions are sometimes 
approximated using simpler parametric distributions. Typically these approximations can 
be further improved by using the distributional approximation as a proposal distribution in 
importance sampling. 

Here we demonstrate the beneht of using PSIS for improving the Laplace approximation 
of a logistic Gaussian process (LGP) for density estimation (Riihimaki and Vehtari, 2014). 
LGP provides a flexible way to dehne the smoothness properties of density estimates via the 
prior covariance structure, but the computation is analytically intractable. Riihimaki and 
Vehtari (2014) propose a fast computation using discretization of the normalization term and 
Laplace’s method for integration over the latent values. 

Given n independently drawn d-dimensional data points Xi,... from an unknown 
distribution in a hnite region V of we want to estimate the density p{x). To introduce 
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the constraints that the density is non-negative and that its integral over V is eqnal to 1 
Riihimaki and Vehtari (2014) employ the logistic density transform, 


_ exp(/(3;)) 

f^exp(f(s))ds' 


( 8 ) 


where / is an nnconstrained latent fnnction. To smooth the density estimates, a Ganssian 
process prior is set for /, which allows for assnmptions about the smoothness properties of 
the unknown density p to be expressed via the covariance structure of the GP prior. To make 
the computations feasible V is discretized into m subregions (or intervals if the problem is 
one-dimensional). Here we skip the details of the Laplace approximation and focus on the 
importance sampling. 

Riihimaki and Vehtari (2014) use importance sampling with a multivariate split Gaussian 
density as an approximation, following Geweke (1989). The approximation is based on the 
posterior mode and covariance, with the density adaptively scaled along principal component 
axes (in positive and negative directions separately) to better match the skewness of the 
target distribution (see also Villani and Larsson, 2006). To further improve the performance 
Riihimaki and Vehtari (2014) replace the discontinuous split Gaussian used by Geweke with 
a continuous version. 

Riihimaki and Vehtari (2014) use an ad hoc soft thresholding of the importance weights if 
the estimated effective sample size as defined by Kong et al. (1994) is less than a specified 
threshold. The Kong et al. (1994) estimate is based on the estimate of the variance of the 
weights and is not valid if the variance is infinite. Here we propose to use the Pareto shape 
parameter diagnostic instead and to use PSIS to stabilize the weights. 

We repeated the density estimation of Galaxy data set^ 4000 times with different random 
seeds. The model has 400 latent values, that is, the posterior is 400-dimensional, although due 
to a strong dependency imposed by the Gaussian process prior the effective dimensionality 
is smaller. Because of this it is sufficient that the split-normal is scaled only along the first 
50 principal component axes. We obtain 8000 draws from the proposal distribution. As a 
baseline we used Markov chain Monte Carlo. 

Figure 8 shows the log density estimate compared to MCMC. The plain split-normal 
without importance sampling performs significantly worse than MCMC. On average, the split- 
normal with importance sampling performs similarly to MCMC, but the computation takes 
only 1.3s compared to about half an hour with MCMC (a laptop with Intel Core i5-4300U 
CPU @ 1.90GHz X 4). However, IS sometimes produces worse results and occasionally even 
worse results than without IS. TIS works better but has one clearly deviating case. PSIS gives 
the most stable performance. In the experiment the average k was 0.52, indicating that the 
variance of the raw weights is infinite. For comparison, for a simple normal approximation 
without split scaling the average k was 0.58, illustrating that the k diagnostic can be also be 
used to evaluate the quality of the distributional approximation. 

The GPstuff toolbox (Vanhatalo et ah, 2013) implementing logistic Gaussian process 
density estimation now uses PSIS for diagnostics and stabilization (code available at https: 
//github. com/gstuff-dev/gpstuff). Another example of using PSIS to diagnose and 

https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/galaxies.html 
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Figure 8: Split-normal importance sampling posterior approximation for the logistie Gaussian 
process density estimate. Violin plot of split-normal without importance sampling (gray), 
regular importance sampling (blue), truncated importance sampling (red), and Pareto smoothed 
importance sampling (yellow) estimates of the log density in fOOO repeated simulations with 
different random seeds. The baseline is the log density computed with MCMC. 


stabilise importance sampling in Bayesian inference as part of an expectation propagation 
like algorithm can be found in Weber et al. (2016). 

5.2. Importance-sampling leave-one-out cross-validation 

We next demonstrate the use of Pareto smoothed importance sampling for leave-one-out 
(LOO) cross-validation approximation. Importance sampling LOO was proposed by Gelfand 
et al. (1992), but it has not been widely used as the estimate is unreliable if the weights have 
inhnite variance. For some simple models, such as linear and generalized linear models with 
specihc priors, it is possible to analytically compute the necessary and sufficient conditions 
for the variance of the importance weights in IS-LOO to be hnite (Peruggia, 1997; Epifani 
et ah, 2008), but this is not generally possible. 

We demonstrate the beneht of fast importance sampling leave-one-out cross-validation 
with the example of a model for the combined effect of microRNA and mRNA expression 
on protein expression. The data were published by Aure et al. (2015) and are publicly 
available; we used the preprocessed data as described by (Aittomaki, 2016). Protein, mRNA, 
and microRNA expression were measured from 283 breast cancer tumor samples and when 
predicting the protein expression the corresponding gene expression and 410 microRNA 
expressions were used. 

We assumed a multivariate linear model for the effects and used Stan (Stan Development 
Team, 2015) to fit the model. As there are more covariates (411) than observations (283) we 
use a sparsifying hierarchical shrinkage prior (Piironen and Vehtari, 2015). Initial analyses 
gave reason to suspect outlier observations; to verify this we compared Gaussian and Student-t 
observations models. 

For 4000 posterior draws, the computation for one gene and one model takes about 9 
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Figure 9; Importance-sampling leave-one-out cross-validation: Error in LOO log predictive 
density (compared to exact LOO) from regular importance sampling (blue circles) and Pareto 
smoothed importance sampling (red crosses), and corresponding k > 0.5 values. The plot on 
the left is for the Gaussian models and the plot on the right is for the Student-t models. 


minutes (desktop Intel Xeon CPU E3-1231 v3 @ 3.40GHz x 8), which is reasonable speed. 
For all 105 genes the computation then takes about 30 hours. Exact regular LOO for all 
models would take 125 days, and 10-fold cross-validation for all models would take about 5 
days. Pareto smoothed importance sampling LOO (PSIS-LOO) took less than one minute 
for all models. However, we do get several leave-one-out cases where k > 0.7. Figure 9 shows 
the k values and error for IS-LOO and PSIS-LOO compared to exact LOO when k > 0.5. 
We see that the magnitude the of errors increases as k increases and that the errors are small 
for k < 0.7. For 0.5 < A: < 1, compared to IS-LOO, PSIS-LOO has a root mean square error 
38% smaller for Gaussian models and 27% smaller for Student-f models. 

As is it more sensitive to outliers, the Gaussian model has many high k values and thus 
LOO distributions can be highly different from full posterior distributions. To improve upon 
PSIS-LOO we can make the exact LOO computations for any points corresponding to k > 0.7. 
In this example there were 330 such cases for the Gaussian models and 80 for the Student-t 
models, and the computation for these took 42 hours. Although combining PSIS-LOO with 
exact LOO for certain points substantially increases the computation time in this example, it 
is still less than the time required for 10-fold-GV. In this case, reasonable results could also 
have been obtained by making the exact computation only when k > 1, which would have 
reduced the time to less than 6 hours for all models. 

Figure 10 shows the hnal results from PSIS-LOO-I- (PSIS-LOO with exact computation 
for cases with k > 0.7), comparing the Gaussian and Student-t model for each gene. It is 
easy to see that there are several genes with strong outlier observations in the data. The 
Student-t model is never worse and the estimated effects of microRNAs using the Gaussian 
model are likely to be bad for many of the genes in the data. 

The previously mentioned loo R package provides an implementation of PSIS-LOO and 
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Figure 10: Importance-sampling leave-one-out cross-validation: Violin plots for estimated log 
predictive density improvement of the Student-t model over the Gaussian model for all 105 
genes using PSIS-LOO-h. 

many examples are reported in Vehtari et al. (2016b), which focuses strictly on the use of 
PSIS for approximate leave-one-out cross-validation. 

6. Discussion 

Importance weighting is a widely used tool in statistical computation. Even in the modern 
era of Markov chain Monte Carlo, approximate algorithms are often necessary for big- 
data problems, and then there is a desire to adjust approximations to better match target 
distributions. It is well known that importance-weighted estimates are unstable if the weights 
have high variance. 

We have found that one can reduce both bias and variance of importance sampling 
estimates using a stabilizing transformation that we call Pareto smoothed importance sampling 
(PSIS) in which the highest weights are replaced by expected quantiles from a generalized 
Pareto distribution ht to the data (that is, to the weights from the available simulations). 
We believe this method will be helpful in many cases where importance sampling is used. 
In addition to the examples in this paper, PSIS has been used to stabilize importance 
sampling as a part of the complex algorithm in Weber et al. (2016), and we are currently 
investigating its use in particle filtering, adaptive importance sampling, and as a diagnostic 
for autodifferentiated variational inference (Kucukelbir et ah, 2014). 
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